Statistical Machine Translation of German Compound Words

نویسندگان

  • Maja Popovic
  • Daniel Stein
  • Hermann Ney
چکیده

German compound words pose special problems to statistical machine translation systems: the occurence of each of the components in the training data is not sufficient for successful translation. Even if the compound itself has been seen during training, the system may not be capable of translating it properly into two or more words. If German is the target language, the system might generate only separated components or may not be capable of choosing the correct compound. In this work, we investigate and compare different strategies for the treatment of German compound words in statistical machine translation systems. For translation from German, we compare linguistic-based and corpusbased compound splitting. For translation into German, we investigate splitting and rejoining German compounds, as well as joining English potential components. Additionaly, we investigate word alignments enhanced with knowledge about the splitting points of German compounds. The translation quality is consistently improved by all methods for both translation directions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

German Compounds in Factored Statistical Machine Translation

An empirical method for splitting German compounds is explored by varying it in a number of ways to investigate the consequences for factored statistical machine translation between English and German in both directions. Compound splitting is incorporated into translation in a preprocessing step, performed on training data and on German translation input. For translation into German, compounds ...

متن کامل

Morphological Processing of Compounds for Statistical Machine Translation

Machine Translation denotes the translation of a text written in one language into another language performed by a computer program. In times of internet and globalisation, there has been a constantly growing need for machine translation. For example, think of the European Union, with its 24 official languages into which each official document must be translated. The translation of official doc...

متن کامل

Simple Compound Splitting for German

INTRODUCTION • Compound: concatention of two or more words Apfel|baum (apple tree) Apfel|kuchen|rezept|sammlung (apple cake recipe collection) • Productive word formation process → infinite amount of possible compounds • Compound splitting useful for many NLP applications – Statistical Machine translation: translation of new compounds, better lexical coverage – Information retrieval: better gen...

متن کامل

Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools

This paper reports on the evaluation of two compound splitters for German. Compounding is a very frequent phenomenon in German and thus efficient ways of detecting and correctly splitting compound words are needed for natural language processing applications. This paper presents different strategies for compound splitting, focusing on German. Four compound splitters for German are presented. Tw...

متن کامل

The pay-offs of preprocessing for German-English statistical machine translation

In this paper, we present the result of our work on improving the preprocessing for German-English statistical machine translation. We implemented and tested various improvements aimed at i) converting German texts to the new orthographic conventions; ii) performing a new tokenization for German; iii) normalizing lexical redundancy with the help of POS tagging and morphological analysis; iv) sp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006